72 research outputs found

    DETECTING GENETIC ENGINEERING WITH A KNOWLEDGE-RICH DNA SEQUENCE CLASSIFIER

    Get PDF
    Detecting evidence of genetic engineering in the wild is a problem of growing importance for biosecurity, provenance, and intellectual property rights. This thesis describes a computational system designed to detect engineering from DNA sequencing of biological samples and presents its performance on fully blinded test data. The pipeline builds on existing computational resources for metagenomics, including methods that use the full set of reference genomes deposited in GenBank. Starting from raw reads generated from short-read sequencers, the dominant host species are identified by k-mer analysis. Next, all the sequencing reads are mapped to the imputed host strain; those reads that do not map are retained as suspicious. Suspicious reads are de novo assembled to suspicious contigs, followed by sequence alignment against the NCBI non-redundant nucleotide database to annotate the engineered sequence and to identify whether the engineering is in a plasmid or is integrated into the host genome. Our initial system applied to blinded samples provides excellent identification of foreign gene content, the changes most likely to be functional. We have less ability to detect functional structural variants and small indels and SNPs produced by genetic engineering but which are more difficult to distinguish from natural variation. Future work will focus on improved methods for detecting synonymous recoding, used to introduce watermarks and for compatibility with synthesis and assembly methods, for using long read sequence data, and for distinguishing engineered sequence from natural variation

    pTSE: A Multi-model Ensemble Method for Probabilistic Time Series Forecasting

    Full text link
    Various probabilistic time series forecasting models have sprung up and shown remarkably good performance. However, the choice of model highly relies on the characteristics of the input time series and the fixed distribution that the model is based on. Due to the fact that the probability distributions cannot be averaged over different models straightforwardly, the current time series model ensemble methods cannot be directly applied to improve the robustness and accuracy of forecasting. To address this issue, we propose pTSE, a multi-model distribution ensemble method for probabilistic forecasting based on Hidden Markov Model (HMM). pTSE only takes off-the-shelf outputs from member models without requiring further information about each model. Besides, we provide a complete theoretical analysis of pTSE to prove that the empirical distribution of time series subject to an HMM will converge to the stationary distribution almost surely. Experiments on benchmarks show the superiority of pTSE overall member models and competitive ensemble methods.Comment: The 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023

    A Survey of Source Code Search: A 3-Dimensional Perspective

    Full text link
    (Source) code search is widely concerned by software engineering researchers because it can improve the productivity and quality of software development. Given a functionality requirement usually described in a natural language sentence, a code search system can retrieve code snippets that satisfy the requirement from a large-scale code corpus, e.g., GitHub. To realize effective and efficient code search, many techniques have been proposed successively. These techniques improve code search performance mainly by optimizing three core components, including query understanding component, code understanding component, and query-code matching component. In this paper, we provide a 3-dimensional perspective survey for code search. Specifically, we categorize existing code search studies into query-end optimization techniques, code-end optimization techniques, and match-end optimization techniques according to the specific components they optimize. Considering that each end can be optimized independently and contributes to the code search performance, we treat each end as a dimension. Therefore, this survey is 3-dimensional in nature, and it provides a comprehensive summary of each dimension in detail. To understand the research trends of the three dimensions in existing code search studies, we systematically review 68 relevant literatures. Different from existing code search surveys that only focus on the query end or code end or introduce various aspects shallowly (including codebase, evaluation metrics, modeling technique, etc.), our survey provides a more nuanced analysis and review of the evolution and development of the underlying techniques used in the three ends. Based on a systematic review and summary of existing work, we outline several open challenges and opportunities at the three ends that remain to be addressed in future work.Comment: submitted to ACM Transactions on Software Engineering and Methodolog

    A versatile route to fabricate single atom catalysts with high chemoselectivity

    Get PDF
    Preparation of single atom catalysts (SACs) is of broad interest to materials scientists and chemists but remains a formidable challenge. Herein, we develop an efficient approach to synthesize SACs via a precursor-dilution strategy, in which metalloporphyrin (MTPP) with target metals are co-polymerized with diluents (tetraphenylporphyrin, TPP), followed by pyrolysis to N-doped porous carbon supported SACs (M1/N-C). Twenty-four different SACs, including noble metals and non-noble metals, are successfully prepared. In addition, the synthesis of a series of catalysts with different surface atom densities, bi-metallic sites, and metal aggregation states are achieved. This approach shows remarkable adjustability and generality, providing sufficient freedom to design catalysts at atomic-scale and explore the unique catalytic properties of SACs. As an example, we show that the prepared Pt1/N-C exhibits superior chemoselectivity and regioselectivity in hydrogenation. It only converts terminal alkynes to alkenes while keeping other reducible functional groups such as alkenyl, nitro group, and even internal alkyne intact

    Major data analysis errors invalidate cancer microbiome findings

    Get PDF
    We re-analyzed the data from a recent large-scale study that reported strong correlations between DNA signatures of microbial organisms and 33 different cancer types and that created machine-learning predictors with near-perfect accuracy at distinguishing among cancers. We found at least two fundamental flaws in the reported data and in the methods: (i) errors in the genome database and the associated computational methods led to millions of false-positive findings of bacterial reads across all samples, largely because most of the sequences identified as bacteria were instead human; and (ii) errors in the transformation of the raw data created an artificial signature, even for microbes with no reads detected, tagging each tumor type with a distinct signal that the machine-learning programs then used to create an apparently accurate classifier. Each of these problems invalidates the results, leading to the conclusion that the microbiome-based classifiers for identifying cancer presented in the study are entirely wrong. These flaws have subsequently affected more than a dozen additional published studies that used the same data and whose results are likely invalid as well

    Genetic Diversity and Linkage Disequilibrium in Chinese Bread Wheat (Triticum aestivum L.) Revealed by SSR Markers

    Get PDF
    Two hundred and fifty bread wheat lines, mainly Chinese mini core accessions, were assayed for polymorphism and linkage disequilibrium (LD) based on 512 whole-genome microsatellite loci representing a mean marker density of 5.1 cM. A total of 6,724 alleles ranging from 1 to 49 per locus were identified in all collections. The mean PIC value was 0.650, ranging from 0 to 0.965. Population structure and principal coordinate analysis revealed that landraces and modern varieties were two relatively independent genetic sub-groups. Landraces had a higher allelic diversity than modern varieties with respect to both genomes and chromosomes in terms of total number of alleles and allelic richness. 3,833 (57.0%) and 2,788 (41.5%) rare alleles with frequencies of <5% were found in the landrace and modern variety gene pools, respectively, indicating greater numbers of rare variants, or likely new alleles, in landraces. Analysis of molecular variance (AMOVA) showed that A genome had the largest genetic differentiation and D genome the lowest. In contrast to genetic diversity, modern varieties displayed a wider average LD decay across the whole genome for locus pairs with r2>0.05 (P<0.001) than the landraces. Mean LD decay distance for the landraces at the whole genome level was <5 cM, while a higher LD decay distance of 5–10 cM in modern varieties. LD decay distances were also somewhat different for each of the 21 chromosomes, being higher for most of the chromosomes in modern varieties (<5∼25 cM) compared to landraces (<5∼15 cM), presumably indicating the influences of domestication and breeding. This study facilitates predicting the marker density required to effectively associate genotypes with traits in Chinese wheat genetic resources

    Identification and Characterization of a Leucine-Rich Repeat Kinase 2 (LRRK2) Consensus Phosphorylation Motif

    Get PDF
    Mutations in LRRK2 (leucine-rich repeat kinase 2) have been identified as major genetic determinants of Parkinson's disease (PD). The most prevalent mutation, G2019S, increases LRRK2's kinase activity, therefore understanding the sites and substrates that LRRK2 phosphorylates is critical to understanding its role in disease aetiology. Since the physiological substrates of this kinase are unknown, we set out to reveal potential targets of LRRK2 G2019S by identifying its favored phosphorylation motif. A non-biased screen of an oriented peptide library elucidated F/Y-x-T-x-R/K as the core dependent substrate sequence. Bioinformatic analysis of the consensus phosphorylation motif identified several novel candidate substrates that potentially function in neuronal pathophysiology. Peptides corresponding to the most PD relevant proteins were efficiently phosphorylated by LRRK2 in vitro. Interestingly, the phosphomotif was also identified within LRRK2 itself. Autophosphorylation was detected by mass spectrometry and biochemical means at the only F-x-T-x-R site (Thr 1410) within LRRK2. The relevance of this site was assessed by measuring effects of mutations on autophosphorylation, kinase activity, GTP binding, GTP hydrolysis, and LRRK2 multimerization. These studies indicate that modification of Thr1410 subtly regulates GTP hydrolysis by LRRK2, but with minimal effects on other parameters measured. Together the identification of LRRK2's phosphorylation consensus motif, and the functional consequences of its phosphorylation, provide insights into downstream LRRK2-signaling pathways
    • …
    corecore